Attention in CNN:
According to [4], attention can be categorized into bottom-up attention (visual saliency, unsupervised) and top-down attention (task-driven, supervised).
According to [5], attention can be categorized into forward attention, post-hoc attention, and query-based attention.
forward attention: spatial attention [16], channel attention [10] [17] [18], full attention [11], Deformable CNN v1 [8] v2 [9],
post-hoc attention: CAM [6], GradCAM [7], scoreCAM [14], trainable CAM [20][21]
query-based attention: [5]
high-order attention [15]
Attention in RNN:
survey paper: survey on the attention based RNN model and its applications in computer vision [1]
soft/hard attention: binary weight or soft weight
item-wise/location-wise attention: location-wise attention is to convert an image to a sequence of local regions, which is essentially item-wise.
Earliest papers [2] [3] are basically the same except design specs of RNN unit.
Reference
[1] Wang, Feng, and David MJ Tax. “Survey on the attention based RNN model and its applications in computer vision.” arXiv preprint arXiv:1601.06823 (2016).
[2] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” arXiv preprint arXiv:1409.0473 (2014).
[3] Vinyals, Oriol, et al. “Grammar as a foreign language.” NIPS, 2015.
[4] Drew Linsley, Dan Shiebler, Sven Eberhardt, Thomas Serre: Learning what and where to attend. ICLR, 2019.
[5] Saumya Jetley, Nicholas A. Lord, Namhoon Lee, Philip H. S. Torr: Learn to Pay Attention. ICLR, 2018.
[6] Bolei Zhou, Aditya Khosla, Àgata Lapedriza, Aude Oliva, Antonio Torralba: Learning Deep Features for Discriminative Localization. CVPR 2016.
[7] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra:
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. ICCV 2017.
[8] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, Yichen Wei:
Deformable Convolutional Networks. ICCV 2017.
[9] Xizhou Zhu, Han Hu, Stephen Lin, Jifeng Dai: Deformable ConvNets v2: More Deformable, Better Results. CoRR abs/1811.11168 (2018)
[10] Wei Li, Xiatian Zhu, Shaogang Gong: Harmonious Attention Network for Person Re-Identification. CVPR 2018.
[11] Cheng Wang, Qian Zhang, Chang Huang, Wenyu Liu, Xinggang Wang: Mancs: A Multi-task Attentional Network with Curriculum Sampling for Person Re-Identification. ECCV (4) 2018.
[12] Zintgraf, Luisa M., et al. “Visualizing deep neural network decisions: Prediction difference analysis.” arXiv preprint arXiv:1702.04595 (2017).
[13] Fong, Ruth C., and Andrea Vedaldi. “Interpretable explanations of black boxes by meaningful perturbation.” ICCV, 2017.
[14] Wang, Haofan, et al. “Score-CAM: Improved Visual Explanations Via Score-Weighted Class Activation Mapping.” arXiv preprint arXiv:1910.01279 (2019).
[15] Chen, Binghui, Weihong Deng, and Jiani Hu. “Mixed high-order attention network for person re-identification.” Proceedings of the IEEE International Conference on Computer Vision. 2019.
[16] Zhu, Xizhou, et al. “An empirical study of spatial attention mechanisms in deep networks.” ICCV, 2019.
[17] Wang, Qilong, et al. “ECA-net: Efficient channel attention for deep convolutional neural networks.” CVPR, 2020.
[18] Qin, Zequn, et al. “FcaNet: Frequency Channel Attention Networks.” arXiv preprint arXiv:2012.11879 (2020).
[19] Zhang, Xiaolin, et al. “Adversarial complementary learning for weakly supervised object localization.” CVPR, 2018.
[20] Jo, Sanhyun, and In-Jae Yu. “Puzzle-CAM: Improved localization via matching partial and full features.” arXiv preprint arXiv:2101.11253 (2021).
[21] Araslanov, Nikita, and Stefan Roth. “Single-stage semantic segmentation from image labels.” CVPR, 2020.